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Abstract 

Background: The ability to construct biologically meaningful gene networks and modules is critical for 
contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to 
shed light on the functioning of complex biological systems, most modules in these networks have shown little 
association with meaningful biological function. We have devised a method which directly incorporates gene 
ontology (GO) annotation in construction of gene modules in order to gain better functional association. 

Results: We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that 
integrates various gene-gene pairwise similarity values, including information obtained from gene expression, 
protein-protein interactions and GO annotations, in the construction of modules using affinity propagation 
clustering. We demonstrated the performance of the proposed method using data from two complex biological 
responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse 
model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly 
stronger association with biological functions. 

Conclusions: The incorporation of semantic similarity based on GO annotation with gene expression and protein- 
protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the 
SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the 
biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high 
throughput experimental data at the gene network level. 



Background 

High throughput technologies to accurately profile the 
transcriptome have been the major thrust of modern 
systems biology [1,2]. Based on the premise that co- 
expressed genes and/or closely interconnected proteins 
are more likely to be related to a particular biological 
function, researchers have made an effort to identify 
groups of genes, called modules [3], to gain a better 
understanding of the biological system of interest. Early 
attempts in building such gene modules depended pri- 
marily on the co-expression property of genes [4-8]. 
Combining gene expression with protein-protein inter- 
action data greatly expanded the modules and in some 
cases enhanced the functional association of modules 
[9-15]. Although gene expression and protein interac- 
tion information have been used in constructing gene 
modules, these approaches have not taken advantage of 
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the vast amount of knowledge accumulated about gene 
annotation/function. Therefore, these approaches would 
not be expected to be maximally effective in construct- 
ing modules with strong association to biological func- 
tions [16,17]. 

Recently, the concept of semantic similarity, which 
was developed for and used in the field of natural lan- 
guage processing [18-20] has been applied to analyze 
gene ontology terms and used to predict and confirm 
protein functions and interactions [21-24]. Wang et al. 
[25] developed a Gene Ontology (GO) [26] structure- 
based measure to quantify semantic similarity between 
individual terms as well as genes and showed the advan- 
tage of using semantic similarity in organizing complex 
biological terms. 

We postulated that the direct incorporation of seman- 
tic similarity based on GO annotation could significantly 
enhance the construction of biologically meaningful 
gene modules which have strong associations with 
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known biological functions. We have therefore devel- 
oped SSIM (Semantic Similarity-Integrated approach for 
Modularization), to integrate various gene-gene pairwise 
relationships including similarity measures based on GO 
biological process (BP) annotation of genes, gene expres- 
sion pattern, and protein-protein interaction informa- 
tion. The integrated information is then used to group 
genes into modules using affinity propagation [27]. Affi- 
nity propagation is a clustering method that does not 
use pre-selected centers for clustering, instead it gener- 
ates exemplars that best represent a group of data 
points, in this case genes, by considering all similarities 
between pairs of data points and testing all data points 
as potential exemplars. A group of data points that have 
the same exemplar can then be considered to be in a 
cluster. The modules generated by SSIM are found to 
be more significantly and specifically associated with 
biological functions than the results obtained from two 
methods, Module Analysis via Topology of Interactions 
and Similarity Sets (MATISSE) [12] and Interaction 
Component Models for Gene modules (ICMg) [14]. 

Results and discussion 

MATISSE was originally evaluated using the expression 
information of 1990 osmotic stress-associated genes in 
yeast and ~69,000 yeast protein interactions. ICMg was 
tested with a slightly smaller set of osmotic stress-asso- 
ciated genes and using a different set of protein interac- 
tion data (see methods). MATISSE also adapted a 
strategy to generate connected modules by including 
information from genes that are not in the original 
selected gene set, referred to back nodes [12]. ICMg 
focused on the information within the set of selected 
genes only. 

To compare and evaluate the performance of SSIM, 
we analyzed the same datasets used in MATISSE and 
ICMg studies (MATISSE dataset and ICMg dataset). 
Since MATISSE and ICMg used probabilistic 
approaches to optimize the construction of gene mod- 
ules, the inferred connections and number of modules 
could vary in each run. To obtain the number of mod- 
ules for comparison, MATISSE was executed 20 times 
as described [14], which yielded the median number of 
gene modules of 24 for both datasets. ICMg was then 
also executed 20 times with a fixed number of modules 
at 24. For SSIM, we adjusted the preference value to 
obtain the same number of modules (see methods). 

Semantic similarity-integrated approach for 
modularization (SSIM) generates functionally relevant 
gene modules 

GO enrichment analysis with statistical testing such as 
Fisher exact test was the most commonly used approach 
to evaluate the functional association of individual gene 



modules. A low p-value between a gene module and a 
GO term would imply a strong association of the mod- 
ule with the specific biological function represented by 
the term. Since SSIM integrates semantic similarity of 
GO BP (biological process) terms in gene module con- 
struction, it is expected to have a better GO enrichment 
performance compared to other methods. Therefore a 
different annotation scheme, MIPS FunCat [28], which 
is independent of GO, was also used to evaluate the 
functional associations of gene modules. 

For a given significance level, the number of modules 
enriched with at least one annotation term and the 
number of annotation terms enriched in at least one 
module are referred as specificity and sensitivity [13]. 
For each method, the sensitivity and specificity were cal- 
culated and summarized into a measure of functional 
enrichment, an F-measure defined as F = 2 x Sensitivity 
x Specificity/(Sensitivity + Specificity) [13]. Note that 
the ratio of modules enriched with at least one annota- 
tion term (i.e. specificity) might be also expressed as 
precision. The results from SSIM showed better func- 
tional enrichment significance (F-measure) for MIPS 
FunCat annotations (as well as GO and GO BP terms) 
than other methods (Figure 1 and Additional File 1). 
ICMg gave comparable results to SSIM, whereas 
MATISSE showed lower performances in terms of both 
sensitivity and specificity. This could be due to a con- 
straint on the size of the modules (a default parameter 
with no more than 100 genes per module was used in 
this study, but it is adjustable by user) and the addition 
of back nodes (genes that were not in the initial set of 
genes but were later included to make connected gene 
modules) in MATISSE. 

To further evaluate the overall functional enrichment 
performance from the three different methods, semantic 
similarity between the terms and the shortest paths of 
the terms to the root (e.g. GO:0008150, biological pro- 
cess) were investigated (see methods). If enriched terms 
were closely related to each other (coherency) and were 
far away from the root in GO hierarchy (depth), they 




SSIM MATISSE ICMg SSIM MATISSE ICMg 



Figure 1 Functional enrichment analysis results The functiona 
enrichment performance of each method was evaluated using 
three different annotation datasets, all GO terms, GO BP terms and 
MIPS FunCat terms. The results were summarised using F-measure 
(Y-axis, see text). Results for MATISSE and ICMg were obtained using 
the mean values of 20 runs. 
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may reveal more specific and detailed biological func- 
tions. Enriched GO terms associated with the modules 
from SSIM have higher average semantic coherency, 
which measures how enriched GO terms are coherent 
in terms of semantic similarity, and comparable depth 
relative to other methods (see method and Additional 
file 2). 

SSIM produces modules with strong functional 
association and high expression homogeneity 

We also investigated the homogeneity of gene expres- 
sion profiles and topological connectivity of modules 
generated by three different methods using average 
Pearson correlation as well as average clustering coeffi- 
cient [29] of genes within the same module. As shown 
in Figure 2, for both datasets, SSIM and MATISSE 
yielded modules with similar levels of expression homo- 
geneity and topological connectivity while ICMg pro- 
duced modules composed of densely connected genes 
with poorly co-expressed profiles. 

This finding also suggests the possibility of using 
SSIM as a tool to explore gene regulatory networks 
since genes with similar expression profiles are com- 
monly co-regulated [4,7,30-33]. Notably, the study con- 
ducted by Ulitsky and Shamir [12] indicates that 
modules generated by random sampling of genes with 
sufficient network connectivity could give favorable 
topological properties and functional enrichment results, 
but with much lower expression homogeneity. This 
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Figure 2 Expression homogeneity and topological connectivity 
of modules. Expression homogeneity and topological connectivity 
of a module were calculated using average Pearson correlation 
among genes in the module and average clustering coefficient of 
the network generated by the genes, respectively. Average 
expression homogeneity and connectivity over all modules were 
shown for each method. For MATISSE and ICMg, mean and 
standard deviation over 20 runs were taken. 



implies that significant GO enrichment results for gene 
modules might be obtained by chance if we just consid- 
ered topological connectivity of genes; thus, additional 
criteria such as gene expression homogeneity must be 
used to ensure the reliability of functional association 
for gene modules. 

For a given number of modules, the SSIM approach 
seems to generate gene modules with better functional 
association and higher correlations in their expression 
homogeneity. For example, one of the ICMg modules 
(ICMg module 23) shared a large portion of genes with 
two modules generated by SSIM (SSIM module 4 and 
10) (see methods and Additional File 3, Table SI). The 
GO terms enriched in the ICMg module 23 implicated 
two biological functions, "transport" and "lipid biosyn- 
thetic process" while in SSIM, the two functions were 
separated into two different modules, module 4 and 10, 
with much higher functional association (lower p-va\ue) 
and expression homogeneity (Figure 3 and Additional 
file 3, Table S2 and S4). 

SSIM can be used to reveal the hierarchical structure of 
gene modules 

To compare the efficiency of constructing gene modules 
with different algorithms, we have used a fixed number 
of modules (24 modules in this study). In SSIM, a larger 
preference value of affinity propagation [27] means that 
every gene is more likely to be an exemplar of a module 
(module center), which would produce a large number 
of modules composed of highly similar genes in terms 
of integrated similarity and allow us to view the system 
in detail. A smaller preference value would generate 
fewer modules which offer a broader, less detailed over- 
view of the system. This implies that SSIM can also be 
used as a tool to generate a functional hierarchy of gene 
modules by virtue of using semantic similarity. As an 
example, we applied the SSIM approach to a MATISSE 
dataset with a wide range of preference values and 
chose three sets of modularization results with 12, 18 
and 37 modules (see Additional file 4) to illustrate the 
hierarchical structure of gene module generated by 
SSIM. Figure 4A shows that large-size modules obtained 
from a smaller preference value were hierarchically 
decomposed into smaller modules using a larger prefer- 
ence value. As an example of the hierarchical structure 
of gene modules generated with SSIM using different 
preference values, one of the module obtained from the 
"12-module set" representing various "transport" func- 
tions is split into two groups, based on the membership 
of "18-module set", showing slightly different expression 
profiles. One of the two groups from the "18-module 
set" is further divided into two smaller groups based on 
"37-module set", which further stratify the "transport" 
function into protein and ion transport (Figure 4B). 



Cho ef al. BMC Systems Biology 201 1, 5:1 1 7 
http://www.biomedcentral.eom/1 752-0509/5/1 1 7 



Page 4 of 9 



133 experiments 



• •••*•*• • 




GO ID 


GO Term 


P-value 


GO:0006810 


transport 


4.75E-30 


GO:0055085 


transmembrane transport 


1.59E-14 


GO:00 15031 


protein transport 


3.78E-12 


GO:0018193 


peptidyl-aminoadd modification 


3.63E-10 


GO:000S888 


ER to Golgi vesi de-mediated transport 


5.69E-0S 


GO:00 15849 


organic add transport 


1 .OOE-06 


GO:0006817 


phosphate transport 


7.90E-06 


GO:0016192 


vesicle-mediated transport 


1 12E-05 


GO.0006S1 1 


ion transport 


1.25E-05 


GO ID 


GO Term 


P-value 


GO 0006810 


transport 


8.72b-09 


GO:0008610 


lipid biosynthetic process 


1 .26E-06 


GO:0018193 


peptidyl-aminoadd modification 


2.66E-06 


GO;0006696 


ergosterol biosynthetic process 


8.33E-06 


GO.0009100 


glycoprotein metabolic process 


9.57E-05 


GO ID 


GO Term 


P-value 


GO:0016126 


sterol biosynthetic process 


1.03E-16 


GO: 0006694 


steroid biosynthetic process 


5.01E-18 


GO:0006696 


ergosterol biosynthetic process 


2.58E-17 


GO:0008610 


lipid biosynthetic process 


5.26E-17 


GO:0006066 


alcohol metabolic process 


2.17E-15 


GO:0044255 


cellular lipid metabolic process 


1.28E-10 


GO:0043094 


cellular metabolic compound salvage 


1.86E-10 


GO:0006629 


lipid metabolic process 


7.59E-Q9 


GO:0008152 


metabolic process 


1 .03E-07 


GO:0008299 


isoprenoid biosynthetic process 


3.25E-06 


GO:0055114 


oxidation reduction 


2.33E-05 


GO:0006633 


fatty add biosynthetic process 


8.11E-05 



Figure 3 Comparison between modules identified by ICMg and SSIM One of the modules identified by ICMg method (ICMg module) 
shares a number of genes with two separate modules identified by SSIM (SSIM module A and B). Expression profiles of genes in three modules 
were shown on the left and protein-protein interaction network was shown in the middle. Genes shared by ICMg and SSIM modules are 
indicated by pink and light blue nodes, and ones exclusively belong to ICMg, SSIM module A and SSIM module B are depicted by white, red 
and blue nodes, respectively. In the right panel, enriched GO BP terms (p < 1 x 10~ 5 ) and their uncorrected p-values for each module are 
summarized. 



Extension of the module identified by SSIM approach 

Suppose that we have information about a set of genes 
U and thereby choose a subset V as genes of interest to 
be used to construct gene modules. The remaining 
genes (V 0 ), where V UV^ = U and VnV 0 = 0, are gener- 
ally not considered in constructing gene modules. How- 
ever, the genes in v might have many interactions with 
the genes in gene set V and potentially provide impor- 
tant information in representing the biological function 
of the constructed modules. Such information can be 
incorporated by including these genes in using a 
module extension procedure (see methods). As an 
example, one of the modules identified by SSIM (using 
the MATISSE dataset) had 72 genes representing func- 
tions related to "cell cycle". The addition of 55 neigh- 
boring interconnected genes from V that were not in 
the initial gene list V 0 increased the significance level of 
enriched GO BP terms and the average clustering coeffi- 
cient from 0.0648 to 0.2934 (see Additional File 5). 

SSIM can be used to reveal biological processes involved 
in complex disease 

The recently described gene networks involved in prion 
disease model was a significant development in the 
advance of systems biology [34]. It demonstrated the 
value of constructing gene networks to reveal biological 
processes involved in this complex disease. The study 
analyzed gene expression profiling results from several 
different experimental conditions and time points for 
prion disease animal model. A set of roughly 300 



differentially expressed genes (DEGs) that were common 
to different experimental conditions (mouse strains and 
prion types) was identified from these analyses. These 
DEGs were first classified by GO annotations, then 
hand-curated and assigned into different gene networks 
based on functional enrichment results and prior knowl- 
edge of the genes. Additional non-differentially 
expressed genes were also included to complete and 
illustrate the functionality of the networks. It was sur- 
prising to find that the pathogenesis of prion disease 
can be largely explained by only four major networks: 
PrP Sc replication and accumulation, microglial and 
astrocytic activation, synaptic degeneration, and neuro- 
nal cell death [34]. Genes in these four networks are 
involved in proteolysis and lipid metabolism for PrP Sc 
replication and accumulation network, immune 
responses for microglial and astrocytic activation net- 
work, mitochondrial dysfunction and apoptosis for neu- 
ronal cell death network, and transcription-related 
function, intracellular signal transduction and ion trans- 
port-related function for synaptic degeneration network. 

To test the feasibility of using SSIM in complex dis- 
ease analysis, the same set of genes used in the prion 
network construction was fed into SSIM. A total of 16 
gene modules were obtained and they could largely sub- 
sume into the four larger manually generated networks 
previously described (see Table 1 and Additional File 6). 
For example, modules 1 and 3 contain a total of 36 
genes that are highly associated with lipid metabolism 
and proteolysis based on GO BP enrichment result, 
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GO ID GO Term 



GO:0006810 
GO:0055085 
GO:0015031 
GO:0006811 
GO:0015849 
GO:0006826 
GO:0030150 
GO:0006886 
GO:0006888 
GO:0016192 
GO:0006414 
GO:0008645 
GO:0051049 
GO:0Q15892 



transport 

transmembrane transport 
protein transport 
ion transport 
organic acid transport 
iron ion transport 

protein import into mitochondrial matrix 

intracellular protein transport 

ER to Golgi tesicle-mediated transport 

\esicle- mediated transport 

translational elongation 

hexose transport 

regulation of transport 

siderophore-iron transport 



1.78E-0 
2.54E-0 
3.35E-0 
5.13E-0 
7.92E-D 
9.74E-P 




GO:0006810 transport 5.17E-30 

GO:0055085 transmembrane transport 5.78E-16 

GO:0015031 protein transport 2.90E-12 

GO:0006888 ER to Golgi vesicle-mediated transport 1.81E-D8 

GO:0016192 \esicle- mediated transport 1.83E-07 

GO:0006817 phosphate transport 1.26E-06 

GO:0Q15849 organic acid transport 3.57E-06 

GO:0006810 transport 9.26E-34 

GO:0015031 protein transport 3.07E-12 

GO:00S5085 transmembrane transport 2.94E-1 1 

GO:0006811 ion transport 1.72E-09 

GO:0006826 iron ion transport 2.6DE-D9 

GO:0030150 protein import into mitochondnal matnx 2.90E-06 

GO:OQ15849 organic acid transport 7.67E-06 




GO:0006810 transport 5.37E 

GO:0015031 protein transport 6 

GO:0030150 protein import into mitochondrial matrix 1.23E 

GO. 0006886 intracellular protein transport 1.05E 

GO:0Q06839 mitochondrial transport 6.23E 



GO:0006810 transport 2.58E 
GO:0006826 iron ion transport 1.28E 
GO:0006811 ion transport 3.92E 
GO:0055085 transmembrane transport A 
GO:0015892 siderophore-iron transport 1 



Figure 4 Hierarchical structure of modules generated by SSIM approach. Using different preference values, three sets of modules (12, 18 
and 37 modules) were obtained (see Additional file 4) and the membership of genes over the three module sets was expressed as a matrix 
form. When gene / and j belong to the same module and the membership is conserved in any one, two and three sets, the element of matrix 
in /th row and yth column is set to 1 (red, least conserved over the sets of modules), 2 (orange) and 3 (yellow, most conserved), respectively. If 
two genes are not in the same module, the corresponding element in the matrix has a zero value (black). Hierarchical clustering result of the 
matrix was shown in (A). Most orange and yellow squares are subsets of red squares along the diagonal, which means that a large module is 
hierarchically split into several smaller modules according to the change of preference values. (B) As an example, the decomposition of a large 
module (red square) identified in "12-module set" into smaller modules obtainable in "18-" and "37-module set" is shown with the expression 
profiles of genes and enriched GO BP terms. A module representing general transport function (12-module set) is stratified into modules 
specifically related to protein and iron ion transport (37-module set). 
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Table 1 Similarity between modules identified by SSIM and four major prion subnetworks 



Module 


Genes in the module 


PrP Sc accumulation 


Microglial activation 


Neuronal cell death 


Synaptic degeneration 


1 


21 


16 (76%) 


5 (24%) 


2 (10%) 


0 (0%) 


2 


27 


7 (26%) 


21 (78%) 


1 (4%1 


1 (4%) 


3 


15 


12 (80%) 


0 (0%) 


3 (20%) 


0 (0%) 


4 


37 


5 (14%) 


5 (14%) 


10 (27%) 


20 (54%) 


5 


42 


16 (38%) 


14 (33%) 


13 (31%) 


4 (10%) 


6 


33 


1 (3%) 


1 (3%) 


12 (36%) 


24 (73%) 


7 


34 


5 (15%) 


22 (65%) 


9 (26%) 


2 (6%) 


8 


37 


6 (16%) 


0 (0%) 


9 (24%) 


24 (65%) 


9 


49 


2 (4%) 


7 (14%) 


37 (76%) 


6 (12%) 


10 


41 


1 9 (46%) 


1 9 (46%) 


8 (20%) 


2 (5%) 


1 1 


39 


7 (18%) 


29 (74%) 


5 (13%) 


4 (10%) 


12 


43 


2 (5%) 


1 3 (30%) 


1 8 (42%) 


1 2 (28%) 


13 


46 


19 (41%) 


38 (83%) 


2 (4%) 


0 (0%) 


14 


57 


5 (9%) 


36 (63%) 


1 7 (30%) 


7 (1 2%) 


15 


67 


5 (7%) 


2 (3%) 


23 (34%) 


49 (73%) 


16 


60 


5 (8%) 


0 (0%) 


2 (3%) 


54 (90%) 



For each module, the number and the proportion of genes corresponding to prion subnetworks are shown. Since some genes are involved in multiple 
subnetworks, the row-wise summation of proportions may exceed 100%. 



which almost completely recapitulate the key nodes 
involved in PrP Sc accumulation network. In addition to 
the modules related directly to the four reported net- 
works, SSIM also suggested processes involved in tissue 
remodeling such as actin cytoskeleton organization 
(GO:0030036), angiogenesis (GO:0001525), multicellular 
organismal development (GO:0007275) and cell differen- 
tiation (GO:0030154) might also be involved in the pro- 
gression of prion disease. We will conduct more 
detailed network analyses with SSIM using prion disease 
datasets including the comparison between different 
incubation times, host strains, and infectious agents, 
since this preliminary study agrees well with the manu- 
ally constructed networks. 

Conclusions 

In this report, we have proposed and demonstrated a 
new approach, Semantic Similarity-Integrated approach 
for Modularization (SSIM), for inferring biologically 
meaningful gene modules. The SSIM method integrates 
various gene-gene pairwise similarity information 
obtained from gene expression profiling results, protein 
interactions and GO annotations to construct gene 
modules. We showed that gene modules generated by 
SSIM gave higher specificity with stronger association 
with biological functions based on the assessment of 
GO and MIPS annotation terms. 

Since SSIM is based on pairwise similarity values of 
genes, there is room for further improvement by inte- 
grating additional quantitative similarity measures. For 
example, GO semantic similarity of other categories 
(molecular function (MF) and cellular component (CC)) 



or newly developed semantic similarity measures [35-37] 
can also be used. In addition, different clustering algo- 
rithms can also be used to enhance the construction of 
gene modules. The evaluation and comparison between 
various gene pairwise similarities and clustering methods 
will be interesting to explore in the future. 

We have also demonstrated the ability to use SSIM to 
uncover the hierarchical structure of gene modules, i.e. 
hierarchical association and dissociation of modules at 
different levels of functional detail. This would allow us 
to gain a systematic understanding of a given biological 
system. When some genes are excluded from the set of 
selected genes (often differentially expressed genes) due 
to marginal expression changes or for other reasons, the 
module extension procedure adapted in SSIM can incor- 
porate them and increase the functional interpretability 
of gene modules without sacrificing the statistical signif- 
icance of functional association of gene modules. 
Although a similar approach has been used in other stu- 
dies [12,15], the effect of network extension on func- 
tional association of gene modules has not been 
evaluated carefully in those methods. 

Even though tools to construct gene modules have 
been developed, they have been mainly applied in data- 
sets from relatively simple model organisms 
[4,5,9,11,13,14]. The SSIM performed well not only with 
the dataset from yeast, but also with a much more com- 
plicated dataset, the prion disease mouse model. Gene 
modules from SSIM effectively recaptured the manually 
constructed key networks described in prion disease, 
and also revealed new processes that might also be 
involved in the disease development. Our results suggest 
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the SSIM approach can quickly infer gene modules with 
coherent biological meaning and thereby accelerate sys- 
tems biology studies in complex diseases. 

Methods 

Reference datasets and methods 

We used 133 expression profiles representing the osmo- 
tic shock response of Saccharomyces cerevisiae in var- 
ious conditions [38]. Ulitsky and Shamir [12] selected 
1990 genes and curated -69,000 protein interactions for 
MATISSE. Parkkinen and Kaski [14] used 1771 genes 
and -10,000 interactions for ICMg. While 1771 genes 
are a subset of 1990 genes, -10,000 interactions are not 
exactly a subset of -69,000 interactions since they used 
a different approach to curate yeast protein interaction 
data. Both of the datasets were used to compare and 
evaluate the performance of all three different methods. 
We used implementations of MATISSE (Java software) 
and ICMg (R-package) with their default parameters. 
These datasets (MATISSE data and ICMg data) and 
implementations were downloaded at http://acgt.cs.tau. 
ac.il/matisse and http://www.cis.hut.fi/projects/mi/soft- 
ware/ICMg, respectively. 

Similarity measures 

Building gene modules can be viewed as the grouping 
genes of interest according to their pairwise similarity 
values. Pearson correlation between each pair of genes 
(gene i and gene /), was used as expression similarity, e^. 
From the global protein-protein interaction network, 
pairwise topological similarity (ty) was obtained using 
topological overlap matrix [39] which was initially devel- 
oped to identify gene modules in E. Coli metabolism. It 
reaches the maximum value of 1 when there is a direct 
connection between two genes and has the value 
between 0 and 1 when two genes are not linked but 
share some direct neighbors. If there is no direct con- 
nection between two genes and no direct neighbors 
shared by them, t t j has the minimum value of 0. GO 
semantic similarity between two genes (gy) was com- 
puted by the metric proposed by Wang et al. [25] which 
took the characteristics of GO hierarchy into account 
and overcame drawbacks of previous semantic similarity 
measures [18-20]. gy has the value of 0 for two genes 
with no similarity and 1 for genes having identical GO 
annotations. Although GO semantic similarity can be 
computed for three categories of GO terms, biological 
process (BP), molecular function (MF) and cellular com- 
ponents (CC), we used the similarity of GO BP terms in 
this study. An overview of GO semantic similarity can 
be found in Pesquita et al. [40]. 

To combine three similarity measures, they were 
assumed to be independent of each other, which is 
probably a good first approximation since they were 



derived from different information sources. First, for 
expression similarity, the empirical cumulative density 
function, f (£) could be estimated from all e i; values, 
where E = {e,y | i = 1,..., n-1, j = n, and n = total 

number of genes of interest} and the probability of hav- 
ing the similarity less than or equal to e,y, Pr(£<e i; ) 
could be obtained from F(ey). With this method, a large 
e, ; value from highly similar pair of genes has a high 
probability - close to one. The probabilities for ty, Pr 
(T<tij) and gy, Pr(G<gy) could be obtained similarly to 
By. Next, the joint probability was calculated by simply 
multiplying three probabilities in accord with the inde- 
pendence assumption, i.e. Pr(£<e,y, T<ty, G<gy) = Pr 
(£<e/ / )-Pr(r<t // )-Pr(G<^y). This is then used as a compo- 
site similarity measure. 

Clustering method 

Affinity propagation is a clustering algorithm using the 
concept of message-passing [27]. It was modeled using a 
factor graph [41] with two types of messages between 
data points (in this study the points were genes) that 
were derived from the max-sum algorithm in the factor 
graph. First consider all data points as potential cluster 
centers and then find high quality cluster centers and 
their members by updating messages. The detailed 
description of the method can be found in [27,42]. Since 
affinity propagation takes non-positive real-valued simi- 
larities as input, the combined similarity was converted 
into Sij = Pr(£<e /; , T<ty, G<gy)-\, Sye [-1,0]. The only 
parameter affecting the number of clusters is the "pre- 
ference" which represents how "preferable" each data 
point is as a cluster center. We used a globally shared 
preference for all genes and found a proper value for 
the comparison study in order to produce the same 
number of modules as other methods, so that the com- 
parisons could be made. We used authors' MATLAB 
implementations of affinity propagation http://www.psi. 
toronto.edu/affinitypropagation. 

Gene ontology enrichment analysis 

Gene ontology data were downloaded from GO consor- 
tium (http://www.geneontology.org, OBO vl.2 format) 
and MIPS FTP site (ftp://ftpmips.gsf.de/catalogue/anno- 
tation_data, FunCat v2.1 last modified on 5/25/2008). 
Yeast gene association with gene ontology and gene 
information data were obtained from the NCBI reposi- 
tory on 2/17/2010 and 9/5/2010, respectively. P-value of 
enriched annotation term was obtained by one-sided 
Fisher exact test without multiple testing corrections. 
The number of enriched annotation terms in at least 
one module at a given significance level (i.e. p < 1 x 10' 
) and the number of modules enriched with at least 
one annotation term at the level were counted and used 
for computing sensitivity and specificity, respectively. In 



Cho et al. BMC Systems Biology 201 1, 5:1 1 7 
http://www.biomedcentral.eom/1 752-0509/5/1 1 7 



Page 8 of 9 



Additional File 1, these numbers were expressed as 
functions of ^-values {p = [10 1 , 10~ 2 ,..., 10~ 20 ]). 

Coherence and depth of enriched gene ontology terms 

For each method used in the comparison study, 
enriched GO BP terms in ith module having uncor- 
rected p- values less than 1 x 10" were chosen. All pair- 
wise semantic similarities between the terms, gBPjk (j = 
1,..., m r l and k = m it where ffl f = the number of 

chosen terms from ith module) were calculated by 
Wang's approach [25] and the depth of each term, d Bpj 
(j = 1,... m ; ) was obtained using the shortest path to the 
root term (GO:0008150 biological process) in GO hier- 
archy. We could obtain representative coherency and 
depth of enriched GO BP terms for the ith module by 
G BPii = 2lm i {m r l)-I J jLk gBP.jk and D Bpi = l/m r Ej d BPJ , 
respectively (i = 1,..., total number of modules identified 
by the method). Mean values of G BP and D BP over all 
modules were defined as average coherency and depth 
of enriched GO terms and shown in Additional File 2. 

Comparison of the modules identified by different 
methods 

For MATISSE and ICMg, due to their probabilistic nat- 
ure, it is not possible to make a unique assignment of 
genes into modules. Instead, the most likely assignment 
can be identified using consensus matrix of which ele- 
ment at ith row and jth column represents how many 
times gene i and gene /' are grouped into the same mod- 
ule among 20 runs. Using hierarchical clustering of the 
consensus matrix, the assignment of genes into 24 mod- 
ules was obtained and used for the comparison between 
ICMg module and SSIM modules (see text and Addi- 
tional File 3). 

Extension of a module 

A module identified by SSIM can be extended according 
to the following procedure. 

Step 1. find neighbor genes which have direct inter- 
actions with more than one gene in the module, but 
are not in the selected gene set. Sort them in des- 
cending order based on the number of interactions 
with genes in the module 

Step 2. for the top scoring neighbor gene, GO 
semantic similarities between the gene and all genes 
in the module were calculated 

Step 3. if the mean value of the similarities com- 
puted in Step 2 is larger than the mean semantic 
similarity among genes in the module, the neighbor 
gene is added to the module 

Step 4. go back to Step 2 until the genes are all 
classified 



This procedure imposes a priority to the neighbor 
genes having many connections with genes in the mod- 
ule (Step 1) and minimizes the involvement of irrelevant 
neighbor genes by increasing average semantic similarity 
of the module at every addition of a relevant neighbor 
gene (Step 3). Although we only used semantic similar- 
ity between the neighbor gene and genes in the module 
due to the limitation of expression profiles, expression 
similarity can be also added to Step 2 and 3 when gene 
expression of neighbors are available. 

Additional material 
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